Published August 11, 2022
Lead Scientific Software Developer and Researcher Igor Sfiligoi, of the San Diego Supercomputer Center (SDSC) at UC San Diego, presented three short papers at the 2022 Practice and Experience in Advanced Research Computing (PEARC22) conference, which was held last month in Boston, Massachusetts.
Sfiligoi has been deeply involved in Graphics Processing Unit (GPU)-accelerated computing, which helped him realize that there are significant issues in how GPU computation is usually accounted for. He said that the issues have only grown larger as GPUs become increasingly faster and as fractional use of GPUs grows in popularity. His paper, The Anachronism of Whole-GPU Accounting, presented at PEARC22 provided experimental data and a detailed analysis of the problem.
“As GPU computing becomes mainstream,” Sfiligoi said. “It is time we stop treating it as a special case and apply available best-practices, instead.”
Sfiligoi explained that the SDSC-affiliated Pacific Research Platform (PRP) is a major GPU resource provider in the academic community. “It’s Kubernetes-based approach has proven very successful, but not everyone is willing to adopt a Kubernetes-first approach,” he said.
One such community that utilizes the Kubernetes framework is the National Science Foundation-funded IceCube experiment, which relies on the HTCondor-based OSG software stack for most of its simulation needs.
Sfiligoi’s PEARC22 paper Auto-scaling HTCondor Pools Using Kubernetes Compute Resources described the “glue” used in PRP to bridge two methods of work.
“Kubernetes is a great system for managing your resources,” Sfiligoi said. “But, HTCondor is much better at spreading the load over many datacenters. Having a way to bridge the two is thus very important.”
Sfiligoi has also been involved with an array of fusion science simulation projects. As with many compute-intensive science domains, fusion scientists tend to make heavy use of GPUs. The challenge for these users is how best to neatly partition between individual compute hosts, so networking becomes a very important part of the overall system. In his paper Comparing Single-Node and Multi-Node Performance of an Important Fusion HPC Code Benchmark, Sfiligoi provided a fusion-relevant comparison among several of the best systems currently available.
“As compute nodes become increasingly faster, we must ensure that networking keeps pace or we will not be able to make good use of them,” he said. “For the past decade or so, this was not the case.”
Sfiligoi also recently authored a Google Cloud article titled Using Google Kubernetes Engine’s GPU sharing to search for neutrinos, which discussed how GPU sharing in Google Kubernetes Engines helped researchers detect neutrinos at the South Pole with the gigaton-scale IceCube Neutrino Observatory.
Sfiligoi, who is also the senior research scientist of Distributed High-Throughput Computing at SDSC, works on a variety of projects in collaboration with SDSC personnel on the PRP and the Open Science Grid (OSG), as well as several external collaborators, including University of Wisconsin-Madison personnel on the IceCube experiment and General Atomics personnel on Fusion Science.
Share